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Abstract 

In most papers establishing consistency for learning algorithms it is assumed that the obser- 
vations used for training are realizations of an i.i.d. process. In this paper we go far beyond this 
classical framework by showing that support vector machines (SVMs) essentially only require 
that the data-generating process satisfies a certain law of large numbers. We then consider the 
learnability of SVMs for a-mixing (not necessarily stationary) processes for both classification 
and regression, where for the latter we explicitly allow unbounded noise. 

Keywords: Support vector machine. Consistency, Non-stationary mixing process, 
Classification, Regression 

1 Introduction 

In recent years Support Vector Machines (SVMs) have become one of the most widely used al- 
gorithms for classification and regression problems. Besides their good performance in practical 
applications they also enjoy a good theoretical justification in terms of both universal consistency 
(see [H [21 [3111]) and learning rates (see [5l El [71 [HI [9] ) if the training samples come from an i.i.d. pro- 
cess. However, often this i.i.d. assumption cannot be strictly justified in real-world problems. For 
example, many machine learning applications such as market prediction, system diagnosis, and 
speech recognition are inherently temporal in nature, and consequently not i.i.d. processes. More- 
over, samples are often gathered from different sources and hence it seems unlikely that they are 
identically distributed. Although SVMs have no theoretical justification in such non-i.i.d. scenarios 
they are often applied successfully. One of the goals of this work is explain this success by estab- 
lishing consistency results for SVMs under somewhat minimal assumptions on the data generating 
process. Namely, we show that for any data-generating process that satisfies certain laws of large 
numbers there exists a sequence of regularization parameters such that the corresponding SVM is 
consistent. By general negative results (see pi)J) on universal consistency for stationary ergodic 
processes this sequence of regularization parameters must depend on the stochastic properties of 
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the data-generating process and cannot be adaptively chosen. However, we show that if the pro- 
cess satisfies certain mixing properties such as polynomially decaying a-mixing coefficients (see the 
definitions in the fohowing sections) then a suitable regularization sequence can be chosen a-priori. 
In addition, a side-effect of our analysis is that it provides consistency for SVMs using Gaussian 
kernels even if the common compactness assumption of the input space is violated. Consequently, 
our consistency results for a-mixing processes generalizes earlier consistency results of [U |2l |3] with 
respect to both the compactness assumption on X and the i.i.d. assumption on the data-generating 
process. 

Relaxations of the independence assumption have been considered for quite a while in both the 
machine learning and the statistical literature. For example PAC-learning for stationary /J-mixing 
processes has been investigated in and more recently, consistency of regularized boosting for 
classification was established for such processes. For a larger class of processes, namely a-mixing 
but not necessarily stationary processes, consistency of kernel density estimators was shown in |12j . 
For bounded, stationary processes with exponentially decaying a-mixing coefficients a consistent 
method for one-step-ahead prediction (also known as "static autoregressive forecasting", see |13j ) 
was presented in [14]. Moreover, for this prediction problem |15j establishes consistency for a certain 
structural risk minimization approach under the assumption that the process is stationary and has 
polynomially decaying /3-mixing rates. For further results and references we refer to [161 117j. 

Relaxations of the stationarity of the process are less common. In fact, to our best knowledge 
[12j is the only work which deals with such processes. One of the reasons for this lack of literature 
may be the fact that for non identically distributed observations there is no obvious way to define 
a reasonable risk functional which resembles the idea of "average future error" . On the other hand, 
it seems obvious that learning methods based on a modified empirical risk minimization procedure 
require at least that the process satisfies certain laws of large numbers. Interestingly, we will 
show that for processes satisfying such laws of large numbers there is always a "limit" distribution 
which can be used to define a reasonable risk functional. Moreover, for many interesting classes 
of processes the existence of such a limit distribution turns out to be equivalent to a law of large 
numbers. 

The rest of this work is organized as follows: In Section [2] we will define the notions "laws of 
large numbers" and "limit" distributions for stochastic processes. We then discuss the relation- 
ship between these concepts and consider specific classes of stochastic processes that satisfy these 
definitions. We then recall some basic classes of loss functions and define consistency of learning 
algorithms for stochastic processes satisfying certain laws of large numbers. Finally, we show that 
SVMs can be made consistent for such processes. In Section [3] we then recall various mixing coef- 
ficients for stochastic processes. These coefficient are then used to establish consistency results for 
SVMs with a-priori chosen regularization sequence. Finally, the proofs of our results can be found 
in Section HI 

2 Consistency for Processes satisfying a Law of Large Numbers 

The aim of this section is to show that SVMs can be made consistent whenever the data-generating 
process satisfies a certain type of law of large numbers (LLNs). To this end we first recall some 
notions for stochastic processes and introduce these laws of large numbers in Subsection 12.11 Some 
examples of processes satisfying LLNs are then presented in Subsection 12.21 In Subsection 12.31 
we then recall some important notions for loss functions and risks. We also define consistency of 
learning algorithms for data-generating processes that satisfy a law of large numbers. Finally, we 
present and discuss our consistency results for SVMs in Subsection 12. 4[ 
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2.1 Law of Large Numbers for Stochastic Processes 

In this subsection we mainly introduce laws of large numbers for general, not necessarily station- 
ary stochastic processes. The concepts we will present seem to be quite natural and elementary, 
and therefore one would expect that they have already been introduced elsewhere. Surprisingly, 
however, we were not able to find any exposition that covers major parts of the material of this 
section, and thus we discuss the following notions in some detail. 

Let us begin with some notations. Given a measurable space Z we write L^i^Z^ for the set of 
all measurable functions / : Z — > R, and CooiZ) for the set of all bounded measurable functions 
/ : Z — > R. Moreover, for a set B C Z we write 1b for its indicator function, i.e. 1b ■ Z — > {0,1} 
with 1b (-2) = 1 if and only \i z ^ B. Let us now assume that we also have a probability space 
(0,^, /x) and a measurable map T : $7 — > Z. Then o"(T) denotes the smallest u-algebra on Q 
for which T is measurable. Moreover, denotes the T-image measure of /i, which is defined by 
Ht{B) := fi(T^^{B)), B C Z measurable. 

Again, let {Q,A,fi) be a probability space and {Z,B) be a measurable space. Recall that for 
a stochastic process Z := (Zj)j>i, i.e. a sequence of measurable maps Zi : Q ^ Z, i > 1, the 
map Z : Q, ^ Z^ defined by w 1-^ (Zj(a;))i is (^, i3^)-measurable. Consequently, Z has an image 
measure fiz which is given by Hz{B) := ^{Z~^{B)) for all B C . 

Furthermore, recall that Z is called identically distributed if = fiz for all i,j > 1, and 
stationary in the wide sense if ^^[Zi^+^,z^^+i) = l^(z^^,Zi^) all ii,i2,i > 1- Moreover, Z is said to 
be stationary if ^(z,j+„...,z,„+,) = M(z,j,...,z,„) for all n,i,ii, ...,in>l. 

As we will see later we are not interested in the data-generating process Z := (Zj) itself, but only 
in processes of the form g o Z := {g o Zj)j>i for g : Z ^ Z' measurable. In the following we call 
go Z an image of the process Z, and Z itself a hidden process. The following definition introduces 
laws of large numbers for stochastic processes by considering real-valued image processes: 

Definition 2.1 Let (il,^, ^) he a probability space, Z be a measurable space, and Z := (Zj)j>i 
be a Z -valued stochastic process on Q.. We say that Z satisfies the weak law of large numbers for 
events (WLLNE) if for all measurable B C Z there exists a constant cs E R such that for all e > 
we have 



Moreover, we say that Z satisfies the strong law of large numbers for events (SLLNE) if for all 
measurable B C Z there exists a constant G R with 



for fi-almost all uj €z 0,. 

It is obvious that Z satisfies the WLLNE if and only if the sequences J2i=i 1b ° Zi) converge 
in probability /i for all measurable B C Z. Consequently, the SLLNE implies the WLLNE but 
in general the converse implication does not hold. Moreover, if Z satisfies the WLLNE then the 
constants cb in ([1]) must obviously satisfy cb G [0, 1] for all measurable B C Z. Finally, if Z 
satisfies the WLLNE or SLLNE then it is a trivial exercise to check that every image g o Z also 
satisfies the WLLNE or SLLNE, respectively. 

It is well known that i.i.d. processes generated by P satisfy the P°°-SLLNE with cb = P{B) 
for all measurable B G Z, but these processes are by far not the only ones (see Subsection 12.21 
for some other examples). For the following development it is instructive to observe that for 




(1) 




(2) 



i=l 
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i.i.d. processes the map B ^ cb defines a probability measure on Z. Our next goal is to show 
that this remains true for general processes satisfying a WLLNE. To this end we first consider the 
averages ^ X^ILi ^m^-B ° -^i of the probabilities of the event B: 

Definition 2.2 Let (17,^, /x) he a probability space, Z be a measurable space, and Z := {Zi)i>i be 
a Z-valued stochastic process on We say that Z is asymptotically mean stationary (AMS) if 

1 " 

P{B) := lim - VE^lBoZi (3) 

i=l 

exists for all measurable B C Z . 

The notion "asymptotically mean stationary" was first introduced for dynamical systems by Grey 
and Kieffer in |18J. We are unaware of any work that introduces this notion for general stochastic 
processes, though a similar idea already appears as assumption (SI) in |12j . 

Using the simple formula lB°g = lg-i(^B) it is obvious that every image go Z of an AMS process 
Z is again AMS. Moreover, identically distributed — and hence stationary — processes are obviously 
AMS. Moreover, for such processes we also have P{B) = fiZi{B) for all measurable B G Z, and 
consequently, ^ defines a probability measure on Z. The following lemma whose proof can be 
found in Section S] shows that the latter observation remains true for general AMS processes. 

Lemma 2.3 Let {0,,A,fi) be a probability space, Z be a measurable space, and Z := (Zj)j>i be a 
Z-valued stochastic process on $7 which is AMS. Then P defined by (0) is a probability measure on 
Z. We call P the stationary mean of {Z,^). 

It it well-known that not every stationary process satisfies a (weak, strong) law of large numbers 
for events. Consequently, we see that in general AMS processes do not satisfy a law of large 
numbers. However, the following theorem proved in Section S] shows that the converse implication 
is true. In addition, it shows that the constants in ([T|) define the stationary mean distribution: 

Theorem 2.4 Let {^,A,^) be a probability space, Z be a measurable space, and Z := (Zj)j>i be a 
Z-valued stochastic process on satisfying the WLLNE. Then Z is AMS and the stationary mean 
P of {Z, /i) satisfies 



lim 

n— >oo 



/ I 1 " \ 

/if jwerJ: |-^lijoZi(a;)-P(S) > e| j = 



(4) 



1=1 

for all measurable B C Z and all e > 0. Moreover, if Z satisfies the SLLNE then 



lim- VlBoZi(cu) =P(5) 



1=1 

holds for fi-almost all lo £ Q. 

Equation (jj]) shows that the stationary mean P describes with high probability our average 
observations from Z. Given a loss function L (see Subsection 12.31 for definitions) it seems therefore 
natural to approximate the empirical L-risk of a function by the corresponding L-risk defined by 
However, in order to make this ansatz rigorous we have to extend ^ to function classes larger 



^For i.i.d. observations one typically argues the other way around. However, for general stochastic processes the 
learning goal should be to minimize the future average loss. This loss is an empirical L-risk which can be approximated 
by the L-risk defined by P. In the training phase of empirical risk minimizers the latter L-risk is then approximated 
by the empirical L-risk of the already observed training samples. In this way P and the corresponding convergence 
rates in Q and ((4]) tell us how well we can generalize from the past to the future. 
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than the set of indicator functions. We begin with the fohowing result that shows that a law of 
large numbers for events implies a corresponding law of large numbers of bounded functions: 



Lemma 2.5 Let (0,^, /x) he a probability space, Z be a measurable space, and Z := (Zj)j>i be 
a Z-valued stochastic process on satisfying the WLLNE. Furthermore, let P be the asymptotic 
mean of {Z,fj,). Then for all f £ Cao{Z) we have 

1 " 

Ep/= hm -V/oZ, (5) 

n— >oo n 

2 = 1 

in probability n and 

1 " 

Ep/= lim - VE^/oZi. (6) 

i=l 

Moreover, if Z actually satisfies the SLLNE then the convergence in (0j holds ^-almost surely. 

For classification problems we usually can restrict our considerations to bounded functions, and 
hence Lemma 12.51 is all that we need. However, for regression problems with unbounded noise we 
have to consider integrable functions, instead. The following definition serves this purpose: 

Definition 2.6 Let (fi,^, /x) he a probability space, Z be a measurable space, and Z := (Zj)j>i he 
a Z-valued stochastic process on Vt. Assume that Z is AMS and let P be the asymptotic mean of 
{Z,fi). We say that Z satisfies the weak law of large numbers (WLLN) if for all f £ Li{P) and 
all e > we have 



(I 1 " \ 
Gn-.l-Y' foZi{u;)-Epf >e].)=0, 
i=i -' 



(7) 



Moreover, we say that Z satisfies the strong law of large numbers (SLLN) if for all f G Li{P) we 
have 

1 

lim - V/oZiH =Ep/ (8) 

ra— >oo n — 

i=l 

for fi-almost all iv £ 0,. 



2.2 Examples of Processes Satisfying a Law of Large Numbers 

In this subsection we recall several examples of stochastic processes satisfying a law of large numbers. 
In particular, we consider independent processes, dynamical systems, and Markov chains. 



2.2.1 Uncorrelated and independent processes 

Recall that two real-valued random variables ^ and rj are called uncorrelated if they satisfy E^r/ = 
Ec^Et/. The following proposition proved in Section 0] shows that AMS, mutually uncorrelated 
processes satisfy a WLLNE: 

Proposition 2.7 Let {Q,A,^) be a probability space, Z be a measurable space, and Z := (Zj)j>i 
be a Z-valued stochastic process on il.. Assume that the random variables Ip o Zj and Ip o are 
uncorrelated for all measurable B C Z and all i,j > 1 with i / j. Then the following statements 
are equivalent: 

i) Z is AMS. 
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a) Z satisfies the WLLNE. 



Considering the proof of the above proposition it is immediately clear that the proposition remains 
true if the process is not uncorrelated but only satisfies 



1 " 

lim E^(^V(1boZ, -E^lfioZi)) =0 (9) 

i=l 



for all measurable B C Z. Processes satisfying such a weaker assumption are introduced and 
discussed in Subsection 13.11 

It is obvious that Proposition 12.71 holds for processes for which the image processes {1b ° ^j)j>i 
are independent. However, by applying |19l Theorem 2.7.1] we have the following stronger result: 

Proposition 2.8 Let {0,,A,^) be a probability space, Z be a measurable space, and Z := (Zj)j>i 
be a Z-valued stochastic process on Q. Assume that 1b o Zi, 1b ° Z2, ■ ■ ■ are independent for all 
fixed measurable B C Z. Then the following statements are equivalent: 

i) Z is AMS. 

ii) Z satisfies the SLLNE. 

Note that the independence assumption in Theorem 12. 81 is weaker than assuming that the process 
is independent. 

By Kolmogorov's well-known strong law of large numbers it is obvious that every process Z 
whose M-valued images g o Z are i.i.d. processes satisfies a SLLN. Moreover, a result by Etemadi 
[20J shows that the independence assumption can be relaxed to pairwise independence. Finally, the 
following result whose proof can again be found in Section [4] generalizes Kolmogorov's law of large 
numbers to a certain type of martingale: 

Proposition 2.9 Let {0,,A,i-t) be a probability space, Z be a measurable space, and Z := (Zj)j>i 
be a Z-valued stochastic process on $7. Assume that for all f G Li{ijlzi) and J-n '■= <y{f °Zi : i > n), 
n> 1, we have nj>i = {^i ^} ^^"-^ 

n+l 

o Zi . (10) 



, n . . n+i 

^ i=i ' 1=1 



Then Z satisfies the SLLN and is the asymptotic mean of (Z,fj,). 



2.2.2 Ergodic processes 

In this section we recall the basic notions and results for dynamical systems. To this end let Z be 
a measurable space and S : Z^ Z^ be the shift operator defined by (zj) 1— > (^^i+i). A set -B C Z'^ 
is called invariant if S^^{B) = B. Moreover, let {^l,A,fj,) be a probability space and Z := (Zj)j>i 
be a Z-valued stochastic process on ^l. Then Z is called ergodic if we have fJ-ziB) £ {0, 1} for all 
measurable invariant subsets B C Z^. It is not hard to see that every image of an ergodic process 
is again an ergodic process. 

In the following we are mainly interested in stationary ergodic processes. To this end let us now 
assume that (Z, B, /i) is a probability space and T : Z ^ Z is a measurable map. Then the stochastic 
process Z := (T*~^)j>i is called a dynamical system, and it is called an invariant dynamical system 
if the T-image /iy of satisfies fi = fix- Recall that an invariant dynamical system Z := (r*~^)j>i 
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on a probability space (Z, B, ^) is ergodic if and only if /i satisfies S {0, 1} for all measurable 

B C Z with T~^{B) = B. Moreover, recall that every stationary process is the image of a hidden 
invariant dynamical system. Conversely, every invariant dynamical system is stationary and hence 
AMS. In addition recall that Birkhoff's theorem (see e.g. p. 82ff]): 

Theorem 2.10 Let Z := (T*~^)j>i ht an invariant dynamical system on a probability space 
{Z, B, fi) . Then the following statements are equivalent: 

i) Z satisfies the SLLNE. 

ii) Z satisfies the SLLN. 
Hi) Z is ergodic. 

With the help of the above theorem one can show (see e.g. \12\ p. 26f]) that every stationary 
ergodic process Z satisfies the SLLN. Moreover, by a theorem by Gray and Kieffer (see e.g. [22l 
p. 33]) we know that a dynamical system Z := {T^~^)i>i is AMS if and only if lim„^oo - "127=1 f ° 
T^^ exists /x-almost surely for all / G Coo{Z). Note that Birkhoff's theorem shows that the 
corresponding limit is a constant function if and only if the dynamical system is ergodic. Finally, it 
is interesting to note that for stationary, ergodic processes the limit relation ([9]) holds (see e.g. [23l 
Thm. 2.19, p. 61]). 

Let us now recall a notion related to ergodicity. To this end let {Z, B, fj,) be a probability space 
and Z := (T*~^)j>i be an invariant dynamical system on Z. Then Z is said to be weakly mixing if 



lim — 

i=0 



UT-\A)nB)-fiiA)f,{B) 



0, A,BeB. 



It is well-known that weak mixing implies ergodicity, and that that the converse implication does 
not hold in general (see e.g. p4i p. 41ff]). Moreover, one can also introduce mixing conditions 
for general stationary ergodic processes. For example, if (J7, A, /i) is a probability space, Z is a 
measurable space, and Z := (Zj)j>i is a Z-valued stochastic process on 0, then Z is called mixing 
if 

lim fiziS-'^iA) nB)= ^xz{A)^lz{B) (11) 



holds for all measurable A,B C Z . One can show (see e.g. [23^ Prop. 2.8, p. 50]) that for invariant 
dynamical systems this definition coincides with the above mixing definition. Moreover, recall that 
i.i.d. processes are invariant and weakly mixing (see |24l p. 58]). 

The weak mixing is important because it allows us to establish the ergodicity of products of 
dynamical systems. This leads to our last example: 

Proposition 2.11 Let ^ be a probability measure on and Z be an invariant ergodic dynamical 
system on (M'^,/x). Furthermore, let {^,A,v) be a probability space and £ be an i.i.d. sequence of 
random variables Si : Q ^ M'^. Then the process Z + £ defined on (M" x Q,, fi v) satisfies the 
SLLN. 

2.2.3 Markov chains 

In this subsection we briefly discuss a law of large numbers for Markov chains. To this end let 
us fix a probability space {Z,B,u). Furthermore, let p : B x Z ^ [0, 1] be a stochastic transition 
function, i.e. a Markov kernel. Let us define a probability measure P on {Z^,B^) by 

P{Bi X • • • X Bn) := J iBiX-xBnizi, . . . , Zn)pidZn, Zn-l) . . . p{dZ2 , Zi)v {dzi) , (12) 
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where n runs over all integers and Bi, . . . , run over all measurable subsets of Z. A Z- valued 
stochastic process Z defined on a probability space {Vl,A,^) is called homogeneou^ Markov chain 
with transition function p and initial distribution v if it satisfies nz = P, where P is determined 
by p^ . Obviously, the sequence (vri)j>i of coordinate projections vTj : Z^ Z, (zj) Zj is a 
canonical model of such a Markov chain if Z^ is equipped with the distribution P. Moreover, if 
the homogeneous Markov chain is stationary then satisfies fiZi = v for alH > 1. 

The transition function describes the probability of Z„_|_i given the state of the process at time 
n. For larger steps ahead one can iteratively compute the corresponding transition probabilities by 

V^^\B,z) = p{B,z) 
P^''+^\b,z) = J p''{B,z')p{dz',z). 

Let us now assume that there exists a finite measure Q on B with Q{Z) > 0, an integer n > 1, and 
a real number e > such that for all measurable B C Z we have 

Q(B)<e =^ < 1 -e forahzGZ. (13) 

This assumption taken from [25^ p. 192] is often called the "Doeblin condition" (see e.g. [25^ p. 197] 
or [261 P- 156]). If Z is a finite set, then is automatically satisfied (see e.g. [251 P- 192]). 
Moreover, if Z C M'^ is a set of finite Lebesgue measure and the distributions p{.z), z S Z are 
absolutely continuous with uniformly bounded transition densities then ([T3|) also holds (see e.g. [25| 
p. 193]). For some similar conditions we finally refer to [26] and the references therein). 

Now, the following theorem which can be found in [251 p. 219] gives a simple condition ensuring 
a SLLN for Markov chains: 

Theorem 2.12 Let {Z,B,v) he a probability space, p : B x Z ^ [0,1] be a stochastic transition 
function and Z = (Zj)j>i be a stationary homogeneous Markov chain with transition function p 
and initial distribution v. If Z satisfies then Z satisfies the SLLN. 

The above theorem can be generalized to non- homogeneous, not identically distributed Markov 
chains. Since these generalizations are out of the scope of the paper we refer to [191 p. 129-135] 
for details. Finally, we would also like to mention without explaining the details that if Z is 
a countable set then an irreducible, positive recurrent, homogeneous Markov chain satisfies the 
SLLNE (see e.g. [271 Thm. 1.10.2]). 

2.3 Loss functions, Risks, and Consistency 

In this section we recall some basic notions for loss functions and their associated risks. We then 
introduce consistency notions for learning algorithms for stochastic processes satisfying a law of 
large numbers. 

In the following X is always a measurable space if not mentioned otherwise and y C M is always a 
closed subset. Moreover, metric spaces are always equipped with the Borel a-algebra, and products 
of measurable spaces are always equipped with the corresponding product u-algebra. Finally, Lp{pi) 
stands for the standard space of p-integrable functions with respect to the measure ^ on X. 

Definition 2.13 A function L : X x y x M ^ [0,oo] is called a loss function if it is measurable. 
In this case L is called: 

^Since we only deal with homogeneous Markov chains we often omit the adjective "homogeneous". 
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i) convex if L{x, y, . ) : M ^ [0, oo] is convex for all x £ X , y £ Y . 
ii) continuous if L{x, y, . ) : M ^ [0, oo] is continuous for all x £ X , y £ Y . 
Moreover, for a probability measure P on X x Y and an f £ ^o{X) the L-risk of f is defined by 

nL,p{f)-= I L{x,y,f{x))dP{x,y)= I J L{x,y, f{x)) dP{y\x) dPx{x). 

XxY X Y 

Finally, the Bayes L-risk is TVi p '■= \ni{TZL,p{f ) ■ f G ^o(^)}- 

Note that the integral defining the L-risk always exists since L is non-negative and measurable. 
In addition it is obvious that the risk of a convex loss is convex on Co{X). However, in general 
the risk of a continuous loss is not continuous. In order to ensure this continuity and several other, 
more sophisticated properties we need the following definition: 

Definition 2.14 We call a loss function L : X x Y x M ^ [0, oo] a Nemitski loss function if there 
exist a measurable function b : X xY ^ [0, oo) and an increasing function h : [0, cxd) [0, oo) with 

L{x,y,t)<b{x,y) + h{\t\) , {x,y,t)£XxYxR. (14) 

Furthermore, we say that L is a Nemitski loss of order p £ (0, oo), if there exists a constant c > 
with h{t) = cfP for all t > 0. Finally, if P is a distribution on X xY with b £ Li{P) we say that 
L is a P-integrable Nemitski loss. 

Note that P-integrable Nemitski loss functions L satisfy '7^l,p(/) < oo for all / £ L^Px), and 
consequently we also have TZl,p{0) < oo and 7^^ p < oo. 

For our further investigations we also need the following additional properties which are satisfied 
by basically all commonly used loss functions: 

Definition 2.15 Let L : X x Y x M ^ [0, oo) be a loss function. We say that L is: 

i) locally bounded if for all bounded A C M the restriction L^xxYxA of L is a bounded function. 

ii ) locally Lipschitz continuous if for all a > we have 

in \L{x,y,t) - L{x,y,t')\ 

\L\a,i := sup sup-' — — < oo. (15) 

t,t'e\-a,a] x€X I* ~ * I 

Hi) Lipschitz continuous if we have \L\i := sup^^Q \L\a^i < oo. 

Note that if y C M is a finite subset and L : Y x M ^ [0, oo) is a convex loss function then L is 
a locally Lipschitz continuous loss function. Moreover, a locally Lipschitz continuous loss function 
L is a Nemitski loss since (jlSp yields 



L{x,y,t)<L{x,y,0) + \L\\t\^^\t\, {x,y,t) £ X x Y x R. (16) 

In particular, a locally Lipschitz continuous loss L is a P-integrable Nemitski loss if and only if 
^L,p(0) < oo. Moreover, if L is Lipschitz continuous then L is a Nemitski loss of order 1. 

The following examples recall that (locally) Lipschitz continuous losses are often used in learning 
algorithms for classification and regression problems: 
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Example 2.16 A loss L : Y x R ^ [0, oo) of the form L{y,t) = (p{yt) for a suitable function (/s : M ^ M 
and all y e F := { — 1, 1} and t e R, is called margin-based. Recall that margin-based losses such as 
the (squared) hinge loss, the AdaBoost loss, the logistic loss and the least squares loss are used in many 
classification algorithms. Obviously, L is convex, continuous, or (locally) Lipschitz continuous if and only if 
ip is. In addition, convexity of L implies local Lipschitz continuity of L. Moreover, L is always a P-integrable 
Nemitski loss since we have 

i(y,i)<max{(^(-0,v(0} (17) 
for all y g y and all i g M. In particular, this estimate shows that every convex margin-based loss is locally 
bounded. Moreover, from (|17p we can easily derive a characterization for L being a P-integrable Nemitski 
loss of order p. 

Example 2.17 A loss L : Y x M ^ [0, oo) of the form L{y, t) — ij){y — t) for a suitable function ^ : R — > R 
and all y e y := R and t e R, is called distance-based. Distance-based losses such as the least squares loss, 
Huber's insensitive loss, the logistic loss, or the e-insensitive loss are usually used for regression. It is easy 
to see that L is convex, continuous, or Lipschitz continuous if and only if ip is. Let us say that L is of upper 
growth p G [1, oo) if there is a c > with 

ij{r) < c{\r\P + l) , reR. 

Analogously, L is said to be of lower growth p £ [1, oo) if there is a c > with 

ip{r) > c {\r\P ~ l) , reR. 

Recall that most of the commonly used distance-based loss functions including the above examples are of 
the same upper and lower growth type. Then it is obvious that L is of upper growth type 1 if it is Lipschitz 
continuous, and if L is convex the converse implication also holds. Moreover, non-trivial convex L are always 
of lower growth type 1. In addition, a distance-based loss function of upper growth type p G [l,oo) is a 
Nemitski loss of order p, and if the distribution P satisfies the moment condition 

\P\p (E(,.,).p|y|P)'/'' :=( [ \y\PdPix,y)) < ^ (18) 

it is also P-integrable. 

If our observations are realizations of a sequence Z of random variables (Xj,l^) : ^ ^ X xY 
satisfying a law of large numbers then the following lemma proved in Section [4] shows that the risk 
with respect to the asymptotic mean distribution P actually describes the average future loss. 

Lemma 2.18 Let (r2,.A, /i) he a probability space, X be a measurable space, Y C M be a closed 
subset, and Z := ((Xj,li))j>i be a X x Y -valued stochastic process on 17 satisfying the WLLNE. 
Furthermore, let P be the asymptotic mean of (Z, /x) and L : X xY xM ^ [0, oo) be a loss function. 
If L is locally bounded then for all f E Coc{X) and all uq > we have 

1 " 

7^L,p(/)=lim V L{Xi,Yi,f{Xi)), (19) 

i=no+l 

where the limit is with respect to the convergence in probability /i. Moreover, if Z actually satisfies 
the SLLNE then il9\) holds fi-almost surely. Finally, the same conclusions hold if L is a P-integrable 
Nemitski loss and Z satisfies the WLLN or SLLN. 

With the help of the above lemma we can now introduce some reasonable concepts describing 
the asymptotic learning ability of learning algorithms. To this end recall that a method C that 
provides to every training set T := ((xi, yi), . . . , (x„, ?/„)) £ {X x y)" a (measurable) function 
/t : X ^ M is called a learning method. The following definition introduces an asymptotic way to 
describe whether a learning method can learn from samples: 
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Definition 2.19 Let {fl,A,fi) be a probability space, X be a measurable space, Y cM. be a closed 
subset, and Z := {{Xi,Yi))i>i be a X x Y -valued stochastic process on VL satisfying the WLLNE. 
Furthermore, let P be the asymptotic mean of {Z, fi) and L : X x y x M — > [0, oo) be a loss function. 
We say that a learning method C is L-consistent for Z if 

lim nLAfTj = nip (20) 

holds in probability /i, where Tn := ((Xi, Yi), . . . , (X„,y„)) and TZ\ p is the Bayes risk defined in 
Definition \2.13X Moreover, we say that C is strongly L-consistent for Z if i20\) holds fj,- almost 
surely. 

2.4 Consistency of SVMs 

In this subsection we present some results showing that support vector machines (SVMs) can learn 
whenever the data-generating process satisfies a law of large numbers. 

Let us begin by recalling the definition of SVMs. To this end let L : X x y x M — > [0, oo) be a 
convex loss function and H he a reproducing kernel Hilbert space (RKHS) over X (see e.g. [28j). 
Then for all A > and all observations T := {{xi,yi), . . . , {xn,yn)) & X x Y there exists exactly 
one element fx^x £ H with 

1 " 

/t,a G argminXWffH + - y2L{xi,yi, f{xi)) . (21) 

Given a null-sequence (A„) of strictly positive real numbers we call the learning method which 
provides to every training set T G (X x y)" the decision function /t,a„ an (A„)-5'l/M based on H 
and L. For more information on SVMs we refer to p9| I30j. 
Moreover, given a distribution P on X x y we say that the RKHS K is (L, P)-rich if we have 

7^2,p,^:= inf 7^i,p(/)=7^2,p, 

i.e. if the Bayes risk can be approximated by functions from H. Note that the condition TZ^ ph ~ 
TV'l P is satisfied (see [31]) whenever, the kernel of H is universal in the sense of [32|, i.e. X is a 
compact metric space and H is dense in the space C(X) of continuous functions. Less restrictive 
assumptions on H and X have been recently found in [31]. In particular, it was shown in |31j that 
the RKHSs H^j, o" > 0, of the Gaussian RBF kernels 

ka{x,x') := exp(— (T^||x — x'112) , x,x'gM'^ 

are (L, P)-rich for all distributions P on R'^ X Y and all continuous, P-integrable Nemitski losses 
L of order p S [l,oo). Finally, one can also find some necessary and sufficient conditions for 
(L, P)-richness on countable spaces X in |31j . 

In order to present our first main result let us recall that a Polish space is separable topological 
space with a countable dense subset whose topology can be described by a complete metric. It is 
well known that e.g. closed and open subset of and compact metric spaces are Polish. Now our 
first theorem shows that for every process satisfying a law of large numbers for events there exists 
an SVM which is consistent for this process: 

Theorem 2.20 Let X be a Polish space, Y C R be a closed subset and L : X x y x M — > [0, 00) 
be a convex, locally Lipschitz continuous, and locally bounded loss function. Moreover, let {Q,,A,ii) 
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be a probability space, Z := ({Xi,Yi))i>i be an X x Y -valued stochastic process on Q. satisfying the 
WLLNE, and P be the asymptotic mean of (Z,^). Finally, let H be an {L, P)-rich RKHS over 
X with continuous kernel. Then there exists a null-sequence {\n) of strictly positive real numbers 
such that the {Xn)-SVM based on H and L is L-consistent for Z. 

In addition, if Z satisfies the SLLNE then (A„) can be chosen such that the {\n)-SVM is strongly 
L-consistent for Z. 

The next theorem estabhshes a similar result for distance-based loss functions (see Example l2.17p 
which, in general, are not locally bounded. 

Theorem 2.21 Let X be a Polish space, y C M 6e a closed subset and L : y x M ^ [0, oo) be 
a convex, distance-based loss function of upper growth-type p G [l,oo). Moreover, let {^},A,IJ,) be 
a probability space, Z := ((Xj,y^))j>i be an X x Y -valued stochastic process on 17 satisfying the 
WLLN, and P be the asymptotic mean of (Z,fj,). We assume \P\p < oo. Finally, let H be the 
(L, P)-rich RKHS of a continuous kernel on X. Then there exists a null-sequence (A„) of strictly 
positive real numbers such that the {Xn)-SVM based on H and L is L-consistent for Z. 
In addition, if Z satisfies the SLLN then (A„) can be chosen such that the {Xn)-SVM is strongly 
L-consistent for Z. 

The techniques used in the proofs of Theorem 12.201 and 12.211 are based on a (hidden) skeleton 
argument in the proof of Lemma 14.51 A more general though standard skeleton argument can 
be used to derive results similar to Theorem 12.201 and 12.211 for other empirical risk minimization 
methods using hypothesis sets with reasonably controllable complexity. Due to space constraints 
we omit the details. 

Let us now assume for a moment that X is a subset of M'^, L is a loss function in the sense 
of either Theorem 12.201 or 12.211 and H is the RKHS of a Gaussian RBF kernel. Then the above 
theorems together with the richness results from [31j show that for all data-generating processes 
Z satisfying a law of large numbers there exist suitable regularization sequences (An) that allows 
us to build a consistent SVM. However, the sequences of Theorem 12.201 or 12.211 depend on Z, and 
consequently, it would be desirable to have either a universal sequence (A„), i.e. a sequence that 
guarantees consistency for all Z, oi a. consistent method that finds suitable values for A from the 
observations. Unfortunately, the following theorem due to Nobel. [TO], together with Birkhoff's 
ergodic theorem shows that neither of these alternatives is possibleo 

Theorem 2.22 There is no learning method which is L\s(^xia.Tes- consistent for all stationary ergodic 
processes (Xj,l^) with values in [0,1] x [0,1], where Lisquares denotes the usual least square loss 
-^lsquarcs(yi i) := {u — t)"^ , y,t gW. Morcovcr, there is no learning method which is Ld^ss- consistent 
for all stationary ergodic processes {Xi,Yi) with values in [0, 1] x {—1, 1}, where iciass denotes the 
classification loss I/ciass(y, *) := l(-oo,o](y signt), y = ±.1, t G M. 

Roughly speaking the impossibility of finding a universal sequence (A^) is related to the fact 
that there is no uniform convergence speed in the LLNs for general processes. More precisely, if 
Z := {{Xi, ii))i>i is a stochastic process which satisfies a law of large numbers then for all e > 0, 
71 > 1, and all suitable functions / : X x y — > R there exists a 6{e, f,n) > with 



^Recall that binary classification is the "easiest" non-parametric learning problem in the sense that negative results 
for this learning problem can typically be translated into negative results for almost all learning problems defined by 
loss functions (cf. p.llSf in [33] for some examples in this direction and the proof of the below theorem in [10] for the 
least squares loss). 




i=l 



(22) 
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and lim„^oo /, ''^) = 0. Now, the proofs of Theorem 12.201 and Theorem 12.211 (essentially) show 
that we can determine a sequence (An) whenever we know such 6{e,f,n) for all e > 0, n > 1, and 
a suitably large class of functions /. However, since there exists no universal sequence (A„) by 
Theorem 12.221 we consequently see that there exists no values 5{e,f,n) such that ()22p holds for all 
(stationary) processes satisfying a law of large numbers. 

This discussion shows that in order to build consistent SVMs for interesting classes of processes 
one has to find quantitative versions of laws of large numbers. For i.i.d. processes such laws have 
been established in recent years by several authors. In the following section we will present a 
simple yet powerful method for establishing quantitative versions of laws of large numbers for 
mixing processes. 



3 Consistency for Mixing Processes 

In this section we derive consistency results for SVMs under the assumption that the data-generating 
process satisfies certain mixing conditions. These mixing conditions generally quantify how much 
a process fails to be independent. In the first subsection we recall some commonly used mixing 
conditions. In the second subsection we then present our consistency results and compare them 
with known consistency results for other learning algorithms. 



3.1 A Brief Introduction to Mixing Coefficients for Processes 



In this subsection we recall some standard mixing coefficients and their basic properties (see e.g. [23] 
and [12\ for thorough treatment). To this end let 17 be a set, A and B be two cr-algebras on O, and 
^ be a probability measure on a{A U B). Furthermore, let H he a Hilbert space and Cp{A, fi, H) 
be the space of all ^-measurable iJ-valued functions that are p-integrable with respect to fi. Using 
we define the following mixing coefficients for the pair (A,B): 



the convention ^ 
a{A,B,fi) :- 



sup\fi{An B) 

AeA 
BeB 



^ I' CO OO ^ 

P{A, B, /i) := - sup<^ 1^ Bj) - M(^i)/^(-Bj)| : (^i) C A and (Bj) C B partitions 

i=l 7 = 1 



(p{A,B,n) 

{A,B,fi) 
Rp{A,B,f,) 



sup 

AeA 
BeB 



n{Ar\B)- fj.{A)n{B) 



^,{A) 



^y^{A,B,ij) ■^{B,A,fi) 



sup 

f£Cp{A,^l,H) 



\\9\\p 



p£ [2,oo] 



It is obvious from the definitions that all mixing coefficients equal if ^ and B are independent. 
Furthermore, besides ip they are all symmetric in A and B. Moreover, we have a{A, B, n) £ [0, 1/4] 
and P {A, B, n), ip{A,B, ipsym{A,B, Rp {A, B, ^) G [0,1]. In addition, they satisfy the relations 
(see e.g. \23\ Section 1] and the references therein): 



2a{A,B,fi) 
4a{A,B,ii) 



< 
< 



PiA,B,f,) 
RfiA,B,f,) 



< ipiA,B,fj,) 

< 2ipsyraiA,B,li) . 



p £ [2,oo]. 



Moreover, the coefficients Rp{A,B,^) are essentially equivalent to the coefficients R^{A,B, ^) for 
the scalar case since [34i Thm. 4.1] shows that for all p G [2, oo] there exists a constant Cp > such 
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that for all Hilbert spaces H we have 

R^{A,B,iJi) < R^{A,B,fi) < CpRf{A,B,fj.). (23) 

Note that for p = 2 we actually have Cp = 1 and for p = oo we may choose the famous Grothendieck 
constant (see the proof of Lemma 2.2 in [35]). Moreover, it is obvious from the definition that 
Rp(A,B,^) is decreasing in p, i.e. 

R^{A,B,fi) < R^{A,B,i2), q<p. 

In particular this yields R^{A,B,fj,) < R^{A,B,n) < R^{A,B,n) for all p £ [2,oo]. Finally, 
Theorem 4.13 in [36| gives the highly non-trivial relation 

Rf{A,B,^^) < 27ra^'l{A,B,i2)Jym{A,B,i2), pG[2,cx)]. (24) 

In view of our consistency results we are mainly interested in the coefficients Rp . Note that with 
the help of the above inequalities these coefficients can be estimated by the typically more accessible 
coefficients a and 99. The coefficient /?, which can often (see [36\ Prop. 3.22] for an exact statement) 
be computed by 

P{A,B,^i) =K,,snp\fi{B)-E^{B\A)\, 

is mainly mentioned because it was used in earlier works (see e.g. [Ill I37j ) on learning from depen- 
dent observations. 

Let us now consider mixing coefficients and corresponding mixing notion for stochastic processes: 

Definition 3.1 Let {^},A,fj,) be a probability space, Z be a measurable space, and Z := (Zj)j>i be 
a Z -valued stochastic process on Q. Furthermore, let ^ be one of the above mixing coefficients. For 
i,j ^ 1 we define the ^-bi-mixing coefficient of Z by 

^{Z,fJ.,i,j) := ^{a{Zi),a{Zj),fi) , 

where (T{Zi) denotes the a-algebra generated by Z^. Furthermore, for n > 1 the ^-mixing and 
^-mixing coefficients of Z are defined by 

^{Z,^,n) := suj)^{Z, fj.,i,i + n) 

i>l 

^{Z,^j.,n) := sup^(cr(Zi,...,Zi),o-(Zj+„,Zj+i+„,...),;u) , 

i>l 

respectively. In addition, we say that the process Z is: 

i) ^-mixing with respect to fj, if the (^-mixing coefficients tend to 0, i.e. 

lim S,{Z, fj,,n) = . 

n— ►oo 

a) weakly ^-mixing with respect to /i if the (^-mixing coefficients tend to on average, i.e. 

1 " 

lim - VC(Z,^,A;) =0. 

n^oo ri ^ — ' 
k=l 
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in) weakly ^-bi-mixing with respect to fi if the S^-bi-mixing coefficients tend to on average, i.e. 

^ n i—1 

lim — VVe(^,^,i,j) =0. (25) 

i=l j=l 

Finally, we define mixing notions analogous to i) and ii) for ^. 

It is immediately clear that (,{Z,fi,n) < (^(Z,fi,n), and consequently, every upper bound on 
(,{Z, /i, n) translates into an upper bound on (,{Z, /i, n). This trivial observation is interesting since 
the literature typically deals with ^{Z, ii,n), whereas the consistency results which we will present 
in the following subsection only require bounds on E,{Z, fi, n) or £,{Z, fi, i, j). Finally, it is interesting 
to note that for stationary, homogeneous Markov chains Z we actually have ^(^, /x, n) = (,{Z, fi, n) 
for all n > 1 and ^ ^ 9?sym- 

Obviously, every ^-mixing process is weakly ^-mixing, and since a simple induction over n G N 
shows 

n i—1 n—1 n—k 

j) = ^ ^ ^{Z,fi,m + k,m), n > 1, 

i=l j=l k=l m=l 

we also see that every weakly ^-mixing process is weakly ^-bi-mixing. Moreover, if the process 
Z is /i-stationary in the wide sense then an elementary proof (see e.g. [361 Prop. 3.6]) shows 
(,{Z, fi, i,j) = S,{Z, ij,,i + k, j + k) for alli,j,k > 1. Since this implies (,{Z, fj,, i,j) = S,{Z, ^,i — j + 1) 
for i > i > 1 we then find 

n i—1 n—1 n—k n—1 

J2 ^(^' ^' •?■) = Y.Y1 l^,^ + k,m) = Y,in-k) i{Z, ^^,k + l) (26) 

i=l j=l k=l m=l k=l 

for all n > 1. Consequently, every stationary weakly ^-bi- mixing process is actually weakly ,^-mixing. 
Moreover, if the process Z is stationary and mixing in the sense of (jlip . then |23t Theorem 4.1] shows 
that P{Z,ij,,no) < 1 or (p{Z, fj,,no) < 1 for some no > 1 implies /3-mixing or (^-mixing, respectively. 
Finally, it is discussed on |231 p. 124] that stationary processes Z with (p{Z, ii^uq) < 1/2 for some 
no > 1 are (^-mixing. 

Examples of ^-mixing, and in particular a-mixing processes including certain Markov, ARMA, 
MA(cxd), and GARCH processes can be found in |381 Sect. 2.6.1] and [361 P- 405ff]. Moreover, mixing 
properties of Gaussian processes are considered in [36^ Chapter 9]. In particular, [361 Theorem 
9.5] shows a{Z,ii,n) < R^{Z, n,n) < 27ra{Z, ij.,n), n > 1, for stationary Gaussian processes. 
Finally, [39j Theorem 26.5] together with [361 Proposition 3.18] shows that for all continuous, 
strictly decreasing functions g : [0, oo) (0,1/24) for which x ^ \ogg{x) is convex there exists a 
stationary process Z with g{n)/A < a{Z,^,n) < ip(Z,fj,,n) < 4:g{n) for all n > 1. Note that this 
result in particular shows that in general the ^-mixing rates can be arbitrarily slow. A brief survey 
of these and other results together with various references is given in [23j. 

For Markov chains there are quite a few results on mixing coefficients (see e.g. [23], [361 Chapter 
7], and [40 ^ Chapter 21]). Here we only recall the most important ones: [361 Theorem 7.5] (see 
also 123\ Theorem 3.3]) shows that if a homogeneous Markov chain Z satisfies R^{Z, ^^uq) < 1 or 
ip{Z, iJ,no) < 1/2 for some uq > 1 then R^{Z, ^,n) or ip{Z,fi,n) tend at least exponentially fast 
to 0, and by considering the proof it is also possible to derive explicit bounds for this convergence. 
Moreover, if the Markov chain is also stationary, ergodic and aperiodic then ip{Z, fj,,no) < 1 suffices 
to obtain exponential (/^-mixing rates. In contrast, for stationary Markov chains there are no similar 
results possible for /3-mixing coefficients (see e.g. |40l Theorem 21.3]) or a-mixing coefficients (see 
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e.g. [Ml Ex. 7.11]). Because of this lack previous learning results based on /3-mixing required rather 
strong additional assumptions on (stationary) Markov chains such as certain variants of geometric 
mixing conditions (see e.g. |1H p. lOOff] and compare with [23i, Theorem 3.7] which shows that 
such geometric mixing conditions are equivalent to exponentially fast /^-mixing). Moreover, [23^ 
Theorem 3.4] shows that stationary, ergodic, and aperiodic Markov chains Z with a(Z^ fi, uq) < 1/4 
for some hq > 1 are automatically a-mixing. Similarly, [231 Corollary 3.6] shows that stationary, 
aperiodic Markov chains are /3-mixing if and only if they are irreducible or Harris recurrent. Finally, 
stationary Markov processes Z satisfying Doeblin's condition (fT3]) satisfy ip{Z, fi,no) < 1 for some 
no ^ 1 (see e.g. [23\ p. 121]). Further information on mixing properties of Markov chains can be 
found in [40l Chapter 21]. 

Now let {Z,B,fi) be a probability space and Z := {T'^~^)i>i be an invariant dynamical system 
on Z. For i > j > 1 we then have cj{Z.i) = a{T^^^) C (t{T^^^) = cr{Zj) and hence we obtain 



Consequently, Z is not weakly a-bi-mixing if B is not /^-trivial. However, note that images of 
dynamical systems can even be strongly a-mixing. Indeed, every i.i.d. sequence is the image of an 
invariant dynamical system and the independence implies that all a-coefficients are equal to 0. For 
more information on ergodic mixing and its relation to ^-mixing we refer to [391 Chapter 22] and 
[23|. 

Let us finally discuss some laws of large numbers for mixing processes. We begin with the following 
simple result which shows that asymptotically mean stationary, weakly bi-mixing processes satisfy 
the WLLNE: 

Proposition 3.2 Let {n,A,fi) be a probability space, Z be a measurable space, and Z := (Zj)j>i 
be a Z-valued stochastic process on which is weakly a-bi-mixing with respect to fi. Then the 
following statements are equivalent: 

i) Z is AMS. 

a) Z satisfies the WLLNE. 

For the quite simple proof of this proposition we refer to Section HI Moreover, using [19|, 
Thm. 8.2.1] it is easy to see that for a-mixing processes AMS is actually equivalent to SLLNE. 
Finally, [414 Cor. 8.2.2] shows that identically distributed processes Z with 



satisfy the SLLN. Note that in the above summability condition only a "few" (^-coefficients are 
considered. In particular, (|27p is satisfied whenever there are constants c > and a > 2 with 
(p{Z,^,n) < c(lnn)~" for all n > 2. 

3.2 Consistency of SVMs for Mixing Processes 

In this subsection we establish consistency results for data-generating processes with known upper 
bounds on the weakly a-bi-mixing rate. Unlike in the case of general processes satisfying a law of 
large numbers these new consistency results give explicit conditions on the regularization sequences 
guaranteeing consistency. 



a{Z,^,i,j)> sup \ fi{A n A) - fi{A)ij.{A)\ = sup iJ.{B){l - fi{B)) . 

A£a{Z,) B€l3 



oo 




(27) 
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In order to formulate these results we have to introduce a new quantity. To this end let A; be a 
bounded kernel over some set X. Then the supremum norm of k is defined by 



||A;||oo := sup \/k{x,x) . 
xex 

Note that the boundedness of k implies ||A;||oo < oo. Moreover, for the Gaussian kernels k^ we have 
lit II — 1 

1 1 1 1 oo — • 

Now we can present our first consistency result which deals with locally Lipschitz-continuous loss 
functions: 

Theorem 3.3 Let X be a separable metric space, Y cM be a closed subset and L : X xY xM. ^ 
[0,00) be a convex, locally Lipschitz continuous loss function with ||L(., ., 0)||oo < c. Moreover, let 
{^},A,fj,) be a probability space, Z := {{Xi,Yi))i>i be an X x Y -valued, AMS stochastic process on 
f], and P be the asymptotic mean of {Z,n). In addition, let H be an (L, P)-rich RKHS over X 
with bounded continuous kernel k. We write 

• c \ 1/2 

Bx:=\\k\U^] , A>0. 



Finally, assume that there are constants C G (0, 00) and a G (0,1] with 



1 " 

n ^ — ' 



n 

i=l 



< C||/||oon-" (28) 



^ n i—l 

-Y,^a{Z,^i,i,j) < Cn-'^ (29) 
^ i=i j=i 

for all f G Coo{Z) and all n > 1. Then for all null-sequence (An) of strictly positive real numbers 
with 

— #^ ^ (30) 

the corresponding (Xn)-SVM based on H and L is L -consistent for Z . 

The above result is of particular interest for binary classification problems. Indeed, recall that 
the standard SVM for classification uses the hinge loss defined by 

L{y, t) := max{0, 1 - yt} , y £ Y := {-1, 1}, t £ R. 

Obviously, this loss function is convex and Lipschitz continuous with \L\i = 1 and L(y,0) = 1 
for y G y. For X := M'^ and H^^ being the RKHS of a Gaussian RBF kernel with fixed width 
a we consequently obtain L-consistency for the corresponding (A„)-SVM whenever A„ — > and 
A^n" — > 00, where a is the exponent satisfying (I28p and (I29p . Since L-consistency implies binary 
classification consistency (see e.g. [SiHS]) we hence see that the above SVM is classification consis- 
tent. In particular, this consistency generalizes earlier consistency results of [HOIS] with respect to 
both the compactness assumption on X and the i.i.d. assumption on the data-generating process. 

In the case of a = 1 the SVMs using the hinge loss L and an (L, P)-rich RKHS is consistent if 
An — > and nA^ — > 00. Since this is exactly the condition ensuring consistency in the i.i.d. case we 
see that such an SVM is quite robust against violations of the i.i.d. assumption. 

If quantitative approximation properties of H in terms of convergence rates for 'R'L,p{fp,x) 
7^2 p s-re known, the proof Theorem 13.31 also provides learning rates. However, we conjecture that 
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these rates are usually overly conservative in terms of the confidence since we only employ Markov's 
inequality. Therefore we do not discuss these convergence rates in further detail. Instead we would 
like to compare our consistency result with the consistency result for regularized boosting algorithms 
derived in [37] . To this end we first observe that for (in the wide sense) stationary processes (|28p 
is automatically satisfied and (j29p is equivalent to 



1 " 

-y^a(Z,fi,i) < Cn~", n>l, 
n ^ 



by (|26|) . Obviously, the latter is satisfied if Z is algebraically a-mixing with exponent a, i.e. if it 
satisfies a{Z,fj,,n) < Cn^°' for all n > 1. Consequently, Theorem 13.31 implies consistency results for 
stationary, algebraically a-mixing processes with known lower bound on the mixing rate. Compared 
to this [37] only establishes a consistency result for stationary, algebraically /3-mixing processes with 
known lower bound on the mixing rate. Since in general a-mixing is strictly weaker assumption 
than /3-mixing we see that Theorem [3]3] substantially weakens the assumptions of [37]. Finally, note 
that our restriction to polynomial rates in (j28p and (j29p is by no means necessary. For example, if 
we replace n~" by (logn)~" in (I28p and (I29p then the corresponding condition on (A^) for the SVM 
using the hinge loss becomes A^(logn)" oo. In particular, note that such an SVM is consistent 
for all stationary, algebraically a-mixing processesO In this direction it is interesting to recall that 
in [12] consistency was established for kernel estimators and algebraically a-mixing, not necessarily 
stationary processes. To our best knowledge this is the consistency result that is closest in its 
assumptions on Z to Theorem 13.31 

The proof of Theorem 13.31 is based on a stability argument together with a simple Markov- 
type concentration inequality for Hilbert space valued random variables. In principle, one could 
also employ exponential type inequalities for sums of R-valued random variables in the sense of 
e.g. \n\ Chapter 1.4] together with a skeleton argument based on e.g. covering numbers. However, 
our preliminary considerations showed that the resulting conditions on (A„) were substantially 
stronger, and hence we do not discuss this approach in further detail. 

The next theorem establishes a result similar to Theorem 13.31 for distance-based loss functions of 
some growth type p: 

Theorem 3.4 Let L : M x M — > [0, oo) be a convex distance-based loss function of upper growth type 
p £ [1,2]. Furthermore, let X be a separable metric space and H be an (L, P)-rich RKHS over X 
with bounded continuous kernel k. Moreover, let {Q,A,ij.) be a probability space, Z := {{Xi,Yi))i>i 
be an X X M-valued, AMS stochastic process on $7, and P be the asymptotic mean of [Z, /x) . Assume 
that we have 

for some q G [p, oo], where \.\q is defined by U8\) . Furthermore assume that there are constants 
C > and a, (3 £ (0, 1] with 



1 " 

r) ^ — ^ 



n 

i=l 

^ n i—1 o_ 9 



< C'II/IIl,(P)^~" (32) 



^^a^ 1 {Z,fi,i,j)(fsym {Z,n,i,j) < Cn ^ (33) 
i=i j=i 



^ However, for such (A„) the SVM typically deals too conservatively with the stochastic part of the learning 
process, so that the approximation behaviour is poor. As a consequence this result does not seem to have any 
practical relevance. 
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for all f G Li{P) H CtiZi ^iil^{Xi.Yi))- Then for all null-sequences (A„) of strictly positive real 
numbers with 

A^n^" ^ oo (34) 

A^Pn^ ^ oo (35) 

the corresponding (A„)-5VM based on H and L is L- consistent for Z. 

Since distance based loss functions are typically used for regression problems we see that the 
above theorem is mainly interesting for these learning scenarios. For Lipschitz continuous losses 
such as the absolute distance loss L(y, i) := ly — 1|, the e-insensitive loss -L(y, t) := max{0, |y — f| — e}, 
the logistic loss or Ruber's robust loss we obviously have p = 1 and hence ([33]) reduces to ([29]) . 
Moreover, for Lipschitz continuous losses we can choose g = 1 in (j3ip . Consequently, it is easy to 
see that all remarks made for the classification SVM using the hinge loss, remain true for regression 
SVMs using one of the above losses. 

In contrast to this the least squares SVM which uses the standard least squares loss L{y, t) := 
{y — t) requires p = 2 in the above theorem. For processes with uniformly bounded noise, i.e. q = oo, 
we again see that ([331) reduces to (f29]l . Moreover, for q G (2, oo) we have 



^ n i— 1 ^ 2 / 1 " \ 



=1 j=i ^ i=i j=i 



so that ([29]) implies (f33]l for /3 := a(l — 2/q). However, for g = 2 we have 1 = 0, and 

consequently we only obtain consistency results for weakly (^sym-bi-mixing processes. 

Theorem 13.41 generalizes the only known consistency result (see [3]) for regression SVMs dealing 
with unbounded noise with respect to both the compactness assumption on X and the i.i.d. assump- 
tion on the data-generating process. In particular, Theorem 13.41 shows that such SVMs are rather 
robust against violations of these assumptions, and consequently it gives a strong justification of 
using such SVMs in rather general situations. 

Finally, we like to mention that condition (|3ip can be replaced by a weaker assumption describing 
the average behaviour of the sequence (|/"(Xi,yi)lg)t>i- However, the resulting conditions on (A„) 
are more complicated and hence we omit the details. 

4 Proofs 

4.1 Proofs from Subsection 12.11 

Proof of LemmalMM- Let B be the a-algebra of Z. We write P„(5) := ^ J27=i K^i G B) for 
B ^ B and n > 1. Then Pn is obviously a probability measure on B for all n > 1. Now the theorem 
of Vitali-Hahn-Saks (see e.g. |43» p. 158-160]) ensures that P{B) := lim^^oo -fn(-B), B £ B, defines 
a probability measure on ;S. ■ 



Proof of Theorem 2.4' Recall that the convergence in probability fj, can be described by the 
metric 

dif,g)-= / min{l,|/-5r|}(i/i, f,g€Co{n). 
Jn 

Moreover, for measurable i? C Z let be the constant satisfying ([T]). The WLLNE and the above 
metric then shows 



/" I 1 

hm / - Ifi o - cs 



d/i = 0. 
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Since is continuous on Li{fi) we hence find 

lim - o Zj = lim [ - S^Ibo Zidfi = lim [ l-S^ 1b o Zi dfi = EJcb\ = cb , 

where the existence of the right limit implies the existence of the left limit. Consequently, Z is AMS 
and we have P{B) = cb- Obviously, the latter together with ([1]) immediately gives dH). Finally, if 
Z satisfies the SLLNE then we obtain the almost sure convergence in (jl]) from ([2]). ■ 

Proof of Lemma \2.5l Let us begin by showing the assertion for the strong law. To this end we 
fix an e > 0. By the approximation lemma for bounded measurable functions there exists a step 
function g : X ^M. with ||/ — g\\oo < £■ Now, the linearity of the limit shows 



1 

Epg = lim - 5 ° Zi{u;) 

n—>oo n — ' 
1=1 

for ^-almost all € and consequently, [44' Lemma 20.6] gives an tt-q > 1 with 

m( sup |-^5-o Zi - Epg <e \ >\-e. 

Moreover, for lj G we have 

I 1 I 
sup -^/oZi(u;) -Ep/ 

< sup -V'/oZi(u;) S^goZ^iijj) ^\-S^g o Zi{u)) -Epg + lEpg - Ep/| 



(36) 



n>no 



i=l 



n 



i=l 



I n 



i=l 



1 1 " I 

< 2e + sup - o Zj(u;) - Epg , 



and hence we obtain 



fsuplif^/oZ.-Ep/ <3e)>l 



e . 



This shows the /i-almost sure convergence in ([5]). Using that the functions ^ X^^Li f ° n ^ 1; 
are uniformly bounded Lebesgue's theorem then yields 



E; 



/= / Ep/d^= / lim iV/oZid^= lim / -V/oZid^= lim ^VE^/oZi, 



and hence we have found ([6]) . Finally, if Z satisfies the WLLNE then pulling the supremum out of 
/i in ([36|) and adjusting the rest of the proof accordingly shows ([5]) with convergence in probability 
11. Moreover, in this case ([6]) can be shown analogously to the argument used in the proof Theorem 
El ■ 
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4.2 Proofs from Subsection 12.21 



Proof of Proposition \2. 7| ; ii) =^ i). Follows from Theorem 12. 4[ 

i) ^ ii). Let P be the stationary mean of {Z,fj,). Then there exists an no > 1 such that 



e 

< - , n > no. 



I n ^-^ 

i=l 

For n > no Markov's inequahty then yields 

/ 1 " \ 

Moo en-. \-Y,^BoZ,{u;)- P{B)\ >e}) 

^71 1 " 



i=l i=l 

n 



< 4e-2n-%(^(lB o - E^Ib o 



2 



i=l 



Let us write /ij := 1^ o Zj — E^l^ o Zj, z > 1. Then we have E^/ij = and G [—1, 1] for all 

i > 1 and all a; € ^2. Moreover, for i ^ j we have E/ij/ij = since we assume that 1b o Zj and 
1b ° Zj are uncorrelated. Consequently, we obtain 



E^(J^(lBoZ,-E^lBoZi)) < 



2 

' n . 



i=l 

from which we easily obtain the assertion. ■ 

Proof of Proposition rOL - Let us define Y := f o Zi and X„ := ^ X^ILi / o ^i, > 1- Then (HUD 
states E(X„_i | J^„) = X„ for all n > 2, and hence we obtain 

Xn = nXn-l I ^n) = E(E(X„_2 | ^n-l) | ^n) = E(X„_2 | = . . . = E(Xi = E(y|.^„) 

for all n > 2. Moreover, Xi is .Fi-measurable and hence we also have Xi = E(Xi \J^i) = ¥,(Y\J^i). 
Now, [45, Theorem 6.6.3] shows that lim„^oo^n = EY almost surely. Furthermore, from Xn = 
W,(Y\J^n), n > 1, we also conclude E^X„ = E^y = E^/ o Zi, and hence fizi is the asymptotic mean 
of {Z,fj,). Combining these results then gives the assertion. ■ 



Proof of Proposition \2.11]; Without loss of generality we may assume that £ is of canonical 
form, i.e. = vri o S'''~^, i > 1, where tti : (M*^)^ M.'^ is the first coordinate projection, S is 
the shift operator on (M'^)^, and u is a product measure, i.e. ly = (/^')^ ^ suitable measure /x' 
on M*^. Then S := {S^^^)i>i is weakly mixing, and consequently [241 p. 65] shows that Z x S 
is /X (8) z^-ergodic. By Theorem 12.101 we can then conclude that Z x S satisfies the /x z^-SLLN. 
Moreover, we have T^~^ _|_ = j-n-i _|_ y^-^ o S^~^ and hence + £^ is an image of the process Z x S. 
From this we easily conclude that Z + 8 satisfies the /i z^-SLLN. ■ 

4.3 Proofs from Subsection 12.31 

Before we prove Lemma 12.181 we first have to recall the following elementary lemma whose proof is 
omitted: 
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Lemma 4.1 Let (oj) be a sequence of real numbers and a G M such that 

1 " 

lim — > ai = a . 



Then for all uq >0 we have 



n^oo Ti 

i=l 



lim Gi = a . 

n^oo n — Uq ^-^ 
j=no+l 



Proof of Lemma \2.18l Let us first assume that L is locally bounded. By Lemma l4.ll it then 
suffices to consider the case uq = 0. Now observe that the function g{x,y) := L{x,y, f{x)), 
G X X y, is a bounded, measurable function since / is assumed to be bounded, and L is 
locally bounded. Applying Lemma 12.51 to the function g then gives the assertion. 
Let us now assume that L is a P-integrable Nemitski loss. Then there exists an 6 G Li[P) and an 
increasing function h : [0, oo) — > [0, oo) with 

g{x,y) < b{x,y) + h{\\f\\^) , ix,y)GXxY. 

This shows g £ Li{P), and hence the assertion follows from Definition 12.61 ■ 

4.4 Proofs from Section 12.41 

For the proof of Theorem 12. 2UI we need some preparations. Let us begin with the following result on 
the existence and uniqueness of infinite sample SVMs which is a slight extension of similar results 
established in [IBl 0] : 

Theorem 4.2 Let L : X x Y x M. ^ [0, oo) be a convex loss function and P be a distribution on 
X X Y such that L is a P-integrable Nemitski loss. Furthermore, let H be a RKHS of a bounded 
measurable kernel over X. Then for all X > there exists exactly one element fp^\ G H such that 

M\fp,x\\l + nLAfp,>) = inf A||/||2, + 7^i_p(/). (37) 



Furthermore, we have \\fp^\\\H < \f^^^^^- 



The following two results describe the stability of the empirical SVM solutions. The first result 
was (essentially) shown in [46^ H] : 

Theorem 4.3 Let X be a separable metric space, L : X x y x M ^ [0, oo) be a convex, locally 
Lipschitz continuous loss function, and P be a distribution on X xY with TZl,p{0) < oo. Further- 
more, let H be the RKHS of a bounded, continuous kernel k over X with canonical feature map 
<^:X^H. We define 

Then for all \ > there exists a bounded, measurable function hx : X x Y ^ M. with 

\\hx\\oo < \L\b„i (38) 

and 

\\fp,X- fT,x\\H < J\\Ephx^-EThx^J, (39) 

for all training sets T = ((xi, yi), . . . , (xn, yn,)) G {X x y)", where Kt denotes the expectation 
operator with respect to the empirical measure associated to T, i.e. Kxg '■= ^Y17=i9(^i^yi)- 
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Recall that convex distance-based loss functions are in general not locally Lipschitz continuous. 
Nevertheless SVM using these losses still enjoy stability as the following result shows: 

Theorem 4.4 Let X be a separable metric space, L : M x M ^ [0, oo) be a convex, distance-based 
loss function of upper growth type p > 1 and P a distribution on X x M with \P\q < oo for some 
q € Ip, oo] . Furthermore, let H be a RKHS of a bounded, continuous kernel over X with canonical 
feature map ^ : X ^ H. Then there exists a constant cl > depending only on L such that for 
all X > there exists a measurable function hx : X x Y ^ M with 

L4P) < 8^cz.(l + |P|r^ + ||/p,Airoo-') (40) 
1 

A' 



|/p,a-/t,aL ^ -\\Ephx<^-EThx<l>\\j, (41) 



for s := all distributions P on X x M with \P\q < oo and all training sets T G {X x 1")". 

Finally, if L is also of lower growth type p then we additionally have 

Whxh.iP) < WcL{l + \P\^q-'){l + \\fp,x\\T). (42) 

Proof: By taking care in the constants in the proof of [H Theorem 13] we obtain a measurable 
function hx X x Y ^ W satisfying ([^T]) and 



\hx{x,y)\<4PcLme,x{l,\y- fp^x{x)r^}, ix,y)eXxY, 

where cl is a suitable constant depending only on the loss function L. For g = oo we then easily 
find the assertion, and hence let us assume q G [p, oo). In this case, the above inequality yields 

\hx{x,y)\' < 4P^cimax{l, \y - fp,x{xW} < 4^^^2-^-14(1 + jy^ + \fp,x{xW) ■ (43) 

Since < p and s > 1 we then obtain (j40p . Moreover, if ip is the function satisfying L{y,t) = 
jp{y — t), y,t gW, we have 

Ep|/p,Ar < 2^^-^ / \y-fpAx)\'' + \y\PdP{x,y) 

JXxY 

< / c^^^i^{y-fp,xix))+l + \y\PdPix,y) 

= 2P-'[ci^nL,p{fp,x) + ^ + \P\^) 

< 2^-l(4^)7^L,p(0) + l + |P|^) 

< 2^^-1(0^ (1 + IPI^) + 1 + IPI^) 

< 2*^0^(1 + 1^1^), 

where c^^\ Sj^ > 1, and > 1 are suitable constants depending only on the loss function L. 
Combining the estimate on Ep|/p.a|^ with (j43p then gives 



/.aIIl.cp) < 4P2^cz.(i + |P|?-i + ||/p,a||oJ' (lEp|/p,Ar)" 



i-p 



< 4P2^cz.(i + |P|?-i + ||/p,a||oJ' (2p4^^(1 + |P|P) 



(3), 



< 4^2^ (ci^y^^ (1 + + |p|^-i) (1 + ii/p,Aii:r) 



23 



where c'^ > 1 is another suitable constant depending only on the loss function L. Now note that 
we have ^±2 = (^E _|_ — 1) < 2{p — 1) and 1 + 7 < 2. These estimates together with 



\P\l < \P\ 



I PI 



< 1 + 



then yield (|i2]) . 



The next lemma establishes Hilbert space valued laws of large numbers which are later used to 
bound the term ||Ep/iA<I> - ¥.Thx^\\^- 

Lemma 4.5 Let ($7,^, ^u) he a probability space, Z be a Polish space, and Z := (Zj)j>i be a Z- 
valued stochastic process on Q. Assume that Z satisfies the WLLNE and let P be the asymptotic 
mean of {Z, /i). Furthermore, let H be a Hilbert space, and ^ : Z ^ H he a continuous and hounded 
map. Then for all h G L^{P) we have 



1 

lim - oZi= Ep/i^> , 



(44) 



1=1 



where the convergence is in probability fi. Moreover, if Z actually satisfies the WLLN then ^44\ ) 
holds for all f G Li{P). Finally, the convergence holds fi-almost surely for all f £ Loa[P) or 
/ G Li [P] if Z satisfies the SLLNE or SLLN, respectively. 



Proof: Let us first show (j44p for / G Li{P) when Z satisfies the SLLN. To this end we first make 
the additional assumption that there exists a compact subset K (Z Z with h{z) = for all z ^ K. 
Now recall that <I> is continuous and hence ^{K) d H \s compact. Moreover, recall that H as a 
Hilbert space has the approximation property (see e.g. [47, p. 30ff] for details on this concept). For a 
fixed e > there consequently exists a bounded linear operator S : H ^ H with m := rank S < 00 
and 

\\S^z) - ^{z)\\h < e , zeK. 



Let ei, . . . , Cm be an ONB of the image SH of H under S. Since {cj, S^) : Z ^ j 
are bounded measurable functions we then find that 



1, . . . ,m, 



{ej,hS<^) = h{ej,S<P) 



J 



I, 



are P-integrable. Consequently, they satisfy the limit relation ([8j), and by a well-known reformula- 
tion of almost sure convergence (see e.g. [331 Lem. 20.6]) there hence exists an such that with 
probability not less than 1 — e we have both 



sup sup 

n>n^ j=l,...,m 



1 " 

- ^(cj, o Zi{iu) - Ep{ej,hS^) 



i=l 



< em' 



-1/2 



and 



sup 

n>n=: 



n 

n ^ 



Z,{u;)-Ep\h\ 



i=l 



<e. 
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Let us fix an n > and an u £ which satisfies these two inequahties. Using h{z) = for all 
z G Z\K we then have 

\\-yih<^) o Z,{uj) - - YihS^) o Ziiuj) < -y\h\oZiiu;)-\\<PoZ,{uj)-S^oZiiu;)\\H 



i=l 



i=l 



< -T\h\oZi{L0) 

n ^-^ 

i=l 

< e{e + Ep\h\) 

< e + eEp\h\. 



Moreover, n and uj also satisfy 
II 1 

- V(/i5$) o Zi(cj) - Ep/i5$ 



1=1 



H 



7=1 i=l ^ 



1/2 



< "v/m sup 

j=l,...,m 

< e. 



1 " 



i=l 



In addition, h{z) = for all z £ Z\K implies 

||Ep/i5$-Ep/i$||^ < / \h{z)\-\\S^{z) -f^{z)\\HdP{z) <eEp\h\ 

J K 

and consequently we can conclude 

||-y(/i«>) oZi(cu) -Ep/i$ < ll-Vf/i^) oZi(aj) - -Vf/iS^) oZi(w) 



i=l 



i=l 



i=l 



H 



II 1 " 

+ \\-y"{hS^) o ZAuj) -EphS^ 

II n ^ 

1=1 



H 



+ ||Ep/i5$-Ep/i$| 



H 



< 2e(l +Ep|/i|) . 



This shows 



{uj en-, sup ||i V(/i^>) o Zj(cj) -Eph^ < 2e(l+Ep\h\)\^ >l-e. 



and hence [441 Lemma 20.6] yields the assertion for our special case. 

Let us now prove the assertion for general h G Li{P). To this end we may assume without loss of 
generality that ||<I>(z)|| < 1 for all z £ Z. Let us fix an e > 0. Since Z is Polish the measures P and 
\h\P are regular and hence there then exists a compact subset K C Z with 



P{Z\K) < e 



and 



/ \h\dP<£. 

Jz\K 



Now g := Ixh is a P-integrable function that vanishes outside the compact set K. Our prelimi- 
nary considerations and the SLLN consequently show that there exists an > 1 such that with 
probability not less than 1 — e we have both 



II 1 



i=l 



< e 



H 
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and 



sup 

n>nr 



1 " 

- Y.{lz\K\h\) o Zi{uj) - ^p1z\k\M 



i=l 



< e 



Let us fix an n > and an a; G ^2 which satisfies these two inequalities. Using h — g = \z\Kh and 
11^(^)11 < 1 for all z & Z we then obtain 



^ n 1 1 " 



i=l 



i=l 



1=1 



H 



II 1 " 

II n 'f— ' 



1 " 



< e + Epl^\^|/i|+e + Eplz\i^|/i| 

< 4e. 



Therefore we obtain 



f 111" 

<^ a; G : sup ° Zi{uj) -^ph^ 

— 1=1 



< 4e 



H 



> 1 -e. 



and hence we obtain the assertion by another apphcation of [HI Lemma 20.6]. 
Finally, if Z only satisfies the WLLN then we obtain the assertion by omitting the terms sup^>„^ 
in the above proof. Moreover, for processes satisfying only a law of large numbers for events we 
have to use Lemma 12.51 instead of Definition 12.61 ■ 

In order to prove Theorem 12.201 we finally need the following technical lemma: 

Lemma 4.6 Let F : (0, cxd) x N ^ [0, oo) he a function with lim^^oo -?^(A, n) = for all A > 0. 
Then there exists a sequence (A„) C (0, 1] with 



lim A„ 

n— »oo 







and 



lim F(A„,n) =0. 
n— >oo 

Proof: For A; > 1 there exists an > 1 such that for all n > we have 

F{k-\n) < k-^ . 



(45) 



Obviously, we may assume without loss of generality that < ?T.fc+i for all /c > 1. For n > 1 we 
write 

1 if 1 < n < ni 

k^^ ii Uj^ < n < nfc_|_i . 



An, 



Now let e > 0. Then there exists an integer k > 1 with k ^ < e. Let us fix an n > nfc. Then there 
exists an i > k with Ui < n < nj+i, and consequently we have A„ = i~^. This gives 



A. 



<k-^ <e, 
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and since (j45p together with rii < n yields F[i ^,n) <i ^ we also find 

F(A„, n) = n) < < e . 

These estimates show the assertion. ■ 

Proof of Theorem \2.2(k We only show the assertion in the case of Z satisfying the SLLNE. 
Since L is locally bounded, the function L(.,.,0) is bounded and hence we may assume without 
loss of generality that TZl^q{0) < 1 for all distributions Q on X xY. By a standard argument this 
assumption leads to 

||/q,a||h < 

for all distributions Q on X xY and all A > 0. Moreover, we may assume without loss of generality 
that ||A;||oo < 1, so that we have ||/||oo < WIWh for all f G H. Now, let us fix an e > 0. Since a 
simple argument shows that limA^o 'T^L,p{fp,\) = p h ~ ^1 p then find 



< 



^l,p(/t„(^),a) - ^l,p(/p,a) + 7^L,p(/p,A) - ni^p 



^ I-^Ia-1/2,1 ||/T„(a;),A " /p,a||oo + £ 



\L\ 



< 



A 



H 



for all n > 1, (J € VL, and all sufficiently small A > 0, where h\ : X x Y ^ \s the function 
according to Theorem 14.31 ^"^^ ^T'nCo;) denotes the expectation operator with respect to the em- 
pirical distribution associated to the training set Tn{uj) = {{Xi{lo),Yi{ui)), . . . , {Xn{uj),Yn{ui))), 
i.e. Et^i-j^-j^i = ^Yli=i 9{^i{^)^'^i{^))- Furthermore, for all A G (0,e] and n > 1 we have 



\ujen: sup ^^^^ '^''^ ||Er^(^)/iA^ - lKphx<^\\^ > e|) 

m>n ^ / 

< fi({u;£n: sup \\ET^^^^hx^ -Ephx^jj > 1) 

V m>n I^Ia-1/2,1 ^ / 

=: F(A,n). 

Moreover, by Theorem 14.31 we know that hx is a bounded function for all A > and consequently. 
Lemma 14.51 vields lim„^oo ^(A, n) = for all A G (0, e]. Now Lemma 14.61 shows that there exists a 
sequence (A„) with A„ — > and F{Xn,n) — > 0. For fixed S > there consequently exists an no > 1 
such that for all n > no we have |'7^l,p(/p,a„) — ^lpI ^ ^^n < and F{Xn,n) < 6. For such n 
our previous considerations then show 



//( jcj E O : sup nL,pifT^{u^),xJ - T^l,P > 2e} ) 

\ ^ m>n ^ / 

< /X I < LJ G : sup 



IET,„M/iA,„^-IEp/iA„^L>e 



4) 



< F{Xn,n) 

< 6. 



This shows the assertion. 
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Proof of Theorem \2.2U - Again, we only show the assertion in the case of Z satisfying the SLLN. 
Obviously, we may assume without loss of generality that ||A;||oo < 1, so that we have ||/||oo < WfWn 
for all f £ H. Moreover, since \P\p < oo we may additionally assume without loss of generality 
that both \P\p < 1 and 7^l,p(0) < 1. Note that the latter assumption immediately yields 

\\fp,x\\H < 

for all A > 0. Let ^jJ : ^ [0,oo) be the function satisfying L{y,t) = ^(y — t), G M. The 
assumption |P|p < oo then guarantees ip € Li{P) and hence the SLLN shows 

hm 7^i,T„(c.)(0) = lim Er„(^)V = ^P^ = ^l,p(0) (46) 

for /u-almost all uj £ VL. Moreover, we have A||/7;^(^-) ,!^|||^ < '^l,t„(w)(0) foi^ all n > 1, A > 0, and 
Lij G il, and consequently the "local Lipschitz continuity" of the L-risk established in [51 Lemma 
25] together with Theorem 14.41 yields 

|^L,p(/T„(a;),A) " ^L,p(/p,a)| 

< Cp{\P\p-i + ||/T„(a;),All?^^ + II/p,a||S7^ + l) ll/T„(a;),A " /p,a||oo 

< ^ (2 + ^ + A-^) ||E^„(.)/..c, _ ^,hMH 

for all n > 1, A > 0, and uo ^Vi. Let us fix an e > 0. For A G (0, e] and n > 1 we then obtain 
^1 |cj G : sup |7^L,p(/T„(^)^A) - ■^l,p(/p,a)| >n] 

\ ^ m>n ^ J 

< fi(1^ujen: sup (^2 + (^ ^^-^-M^"^ ) + A-^^ mT^i^)hx'^ - ^phx^H > 
=: F(A,n). 

Moreover, Theorem 14.41 ensures hx G Li(P) for all A > and hence Lemma 14.51 together with (j46p 
shows lim„^oo F{X, n) = for all A G (0, e]. Now the rest of the proof is analogous to the proof of 
Theorem ■ 

4.5 Proofs from Subsection 13.11 

Proof of Proposition \3.^ ii) =^ i). Follows from Theorem 12. 4[ 

i) ^ ii). Let P be the stationary mean of {Z,n). As in the proof of Proposition 12.71 we then find 
an no > 1 such that for all n > no we have 



I 1 " \ " 

{c^Gf): \-Y^lBoZi{uj)-P{B)\ >e}j < 4e-2n-%(^(lp o - E^lp o 



Let us write hi := 1b o Zi — E^lp o Zj, i > 1. Then we have E^/ij = and hi(uj) G [—1, 1] for all 
i > 1 and all u> Q. Consequently, (fM|) gives R^{Z, j) < 2TTa{Z, ^,i,j), i,j > 1, and hence 
we obtain 

n ^ n n i—1 n i—1 

{^{^B o Zi - E^Ib oZi)^ =E^^hj+ 2E^ Yl Yl ^^^1 ^ ^ + 4^ «(^' ^' 3) ■ 

i=l i=l i=l j=l i=l j=l 

Combining the estimates then yields the assertion. ■ 
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4.6 Proofs from Subsection 13.21 

Proof of of Theorem\3lE; Let B be the a-algebra of Z. We write Pn{B) := ^ Y17=i K^i G B) 
for B £ B and n > 1. Then P„ is obviously a probabihty measure on B for ah n > 1. Let us first 
show that 

hm nL,p{fPn,xJ=n,p- (47) 
To this end we first observe that the assumption (j28p yields 

^L,p(/p„,A„) < A„||/p„^a„|Ih +7?.L,p„(/p„,A„) + CIlLo /p^^^^lloon"" 

< A„||/p,A„ 11?^ + TZLMfPM) + C\\L o fp^^^^ lUn-" 

< K\\ f p^ Jl + nLAfp,x J + Cn-''{\\Lofp,x Jo. + \\Lofp^^,J^) (48) 

for all n > 1. Now TV^ pH = '^l,p together with An ^ yields Xn\\fp,\J\'j{ + T^L,pifp,\J '^l,p- 
Moreover, for every distribution Q on Z we have 

||^o/q,a||oo < C+ .^ii^^iII/q^aIIoo < C+\L\b^,iBx 

by p6|) and Theorem 14. 2i In addition, (|-L|_B;^ ,i) is a non-decreasing sequence and the sequence 

{Bx„) is dominated by the sequence (A„ ). Consequently, ([50]) implies n~°'\L\p^^^iBx^ — > and 
hence we find (j47p . Let us now fix an e > 0. Then Theorem 14.31 and Markov's inequality yield 

^(^[iven-. |7^L,p(/T„H,AJ - ^l,p(/p„,aJ| > e}) 
< G : \L\b^^,i ||/t„h,a„ - /p„,A„||oo > e}^ 



^ 2TT^K-~MlFT„H^n^-Ep„/ln^||^ 

where /i„ is the function according to Theorem 14.31 for the distribution P„ and the regularization 
parameter A„. Let us define 

gn,i ■■= {hn^) o (X„ Yi) - E^(/l„$) o (X„ Fi) 

for n > 1 and i = 1, . . . , n. Then we have lE^^r^^j = and Theorem 14.31 vields 

llffn.illoo < 2 sup \\{K^) o {Xi,Yi){uj)\\H < 2 ||/l„||oo ||/c||oo < 2 || A;|| oo | i| B;,^ ,1 • 
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n 



n 



Consequently, (j24p and (j23p show that there exists a universal constant c > 1 such that 

n 

1=1 

n n i—1 

n,ii 9n,j) 

i=l i=l j=l 

n n i—1 

1=1 i=l j=l 

n i—1 

< 4n-iA:||Ll^||,„,i + c||A;||Ll^||,„,in-2 5^5^a(^,//,i,j) 

i=i j=i 

for all n > 1. By combining all estimates and using (jSOp we then obtain the assertion. 



J l|oo 



Proof of Theorem 3.4' Without loss of generality we assume ||A;||oo < 1 and \fJ'(Xi,Yi)\'i — ^ 
all i > 1. In addition, we can obviously, also assume A„ S (0,1] for all n > 1. Now, we define 



n Z-^i- 

calculation then shows 



PniB) := nY17=i f^i^i ^ ^) ^'^^ measurable B C X x M and n > 1. For r G a simple 



/" 1 v-^ f 1 \— ^ 

\Pn\l= |yrdP„(x,y) = - V / |yrd/i(x,.y,)(x,y) = - < 1- (49) 

JXxM ''T'~lJXxR ' ^ 

Moreover, |44l Thm. 23.8] together with Fatou's lemma yields 

\P\; = / P({(x,y) G X X M : > = / lim - V Ai({a; G : |yi(w)r > 

< liminf / fi({uj en ■.\Y,{lo)\^ >t})dt 

•'^ i=l 
1 " 

< liminf- |//(x„y,)lr 



< 1. 



Having finished these preparations we can now begin with the actual proof. To this end first observe 
that we obtain 

TlLAfPnM) < A„||/p,Aj|?,+^L,p(/p,Aj+Cn-"(||i^o/p,,J|^^(P) + ||Lo/p„,,JU^(P)) 
as in ([18|) . Moreover, we obviously have \\L o /p,a„||li(p) = 'J^L,p{fp,x„) < ^l,p(0) < c for some 
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constant c independent of n. In addition, (ji9]) yields 

II-^^°/p„,aJ|li{p) = / il^iy - fp„,x„{x))dP{x,y) 

JXxY 

< cvl l + |yr + |/p„,A„(x)rdP(x,y) 

JXxY 



< 2Cp + Cp||/p„,Aj|P 

I 

"a, 



n ,An 1 1 OO 

E 



< 2cj, + c„A^ 2 



where Cp and Cp are constants only depending on L and p. Combining these estimates with 
limA^o^L,p(/p,A) = T^*L,P,H = '^L,p and ^ we then obtain lim„_oo ■^l,p(/p„,a„) = ^l,p- 
Now let us assume that we have an a; G and an n > 1 with ||/T„(a;),A„ ~ /Pn,A„ \\h < 1- For p > 1 
a simple calculation using [H Lemma 25] and A„ < 1 then shows 

|'^L,P(/P„,A„) - ■^L,p(/T„(a;),A„)| 
- + ll/Pn,A„ll?^^ + \\fT„(Lu),\J\^^ + ^) ll/i'n.An " /t„ (o;) , A„ 1 1 OO 

< Cp(2 + 2\\fp„^xJ\^^ + ||/t„h,a„ - /p„,A„||S7^) ll/p„,A„ - /t„h,a„I|// 

< Cp ^3 + 2^ ^^"^ ^ ll/p„,A„ - /t„h,a„I|j/ 

< CpXn ll/Pn,A„ - /T„(a;),A„l|H 

P+i 

< CpXn ^ ||Er^(^)/l„$ -Ep„/l„$||^, 

where Cp > 1 and (7p > 1 are constants only depending on p and L, and /i„ is the function according 
to Theorem 14.31 for the distribution P„ and the regularization parameter A^. Moreover, for p = 1 
we see that L is Lipschitz continuous by ^ Lemma 4] and hence the above estimate is also true in 
this case. Let us now define 

gn,i := {K^) o {Xi,Yi) - E^(/i„$) o {Xi, Yi) 

for n > 1 and i = 1, . . . , n. Then we have E^^i^^j = and for s := we find 

\\9n,i\\L,{^,) < 2||/i„||i4^(^^_y^j) ^ 128cL(l + |^(x„y,)ir^ + ll/p. 



'n-^-^n Woo } 



< 128cL 2 + 



Aji 



2 



where Cl,p > is a constant only depending on L and p. For 5 > Markov's inequality together 
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with s > 2, dM]) and ((23]) thus yields 



^ / 71 71 i—l \ 

n,i-, 9n,j) I 

^i=l i=l j=l ^ 

^ ^ 71 n i—1 s 

- 72^2 ( ll5n,i|lL(;.) + 2 X] ^'.^Olbn,* IIl,(/i) lbn,il|L4/i) ) 

^i=l i=l 7 = 1 ^ 



(52ArV/5 



where C^^p > is another constant only depending on L and p. Let us now fix an e G (0, 1]. For 
U! £ and n > 1 with 

Op 

. {p-l)/2 

we then have ||/t„(lj),a„ ~ /p„,A„||i/ < ^-''g; < 1, and consequently we can conclude 

^i(^[u;en■. |7^L,p(/p„,AJ - 7^L,p(/T„H,AJ| < e}) 

> G 17 : ||Er„(^)/i„$ - Ep„/i„$||^ < IJ 



> 



e2A^''n/3 



Using (pI5|) then yields the assertion. ■ 
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